Lors de cette formation nous allons apprendre à visualiser des jeux de données tabulaires avec ggplot2.
Lors de cette formation nous allons ustiliser plusieurs packages qui contienent les fonctions dont nous avons besoin: - tidyverse qui contient plusiers packages dont ggplot2 - visdat qui permet une représentation rapide des données - plotly pour faire des graphiques interactifs
Vous avez normalement déjà installé ces packages. Pour vérifier qu’ils sont bien installé, et pour les importer dans votre session, utilisez la fonction library() :
library(tidyverse)
library(visdat)
library(plotly)
Vous devez aussi vous assurer que le répertoire de travail de R est bien le dossier qui contient le matériel de la lesson. Le chemin vers ce dossier va être différent en fonction de votre système opérateur.
setwd("~/Desktop/ggplot_course/materiel")
Pour cette lesson, nous allons utiliser une version légèrement modifée du jeu de donnée publié par Burghard et al 2015.
La version simplifiée des données est dans le dossier data (burghardt_et_al_2015_expt1.txt). Il s’agit de données pour des phénotypes associés au temmps nécessaire à des plantes de différents génotypes pour fleurir dans différéntes conditions.
Comme notre répertoire de travail est le dossier materiel , nous devons lire les données ainsi:
# Lecture des données et chargement dans la variable (objet) expt1
expt1 <- read_tsv("../data/burghardt_et_al_2015_expt1.txt")
## Parsed with column specification:
## cols(
## genotype = col_character(),
## background = col_character(),
## temperature = col_double(),
## fluctuation = col_character(),
## day.length = col_double(),
## vernalization = col_character(),
## survival.bolt = col_character(),
## bolt = col_character(),
## days.to.bolt = col_double(),
## days.to.flower = col_double(),
## rosette.leaf.num = col_double(),
## cauline.leaf.num = col_double(),
## blade.length.mm = col_double(),
## total.leaf.length.mm = col_double(),
## blade.ratio = col_double()
## )
La fonction read_csv() imprime un message indiquant quel type de donnée est contenue dans les differentes colonnes du fichier.
Dans noter cas, certaines colonnes contiennes des données de type “character” (du texte) et d’autres des données numériques (“double” en présence de décimale, “integer” en absence de décimale).
Pour regarder rapidement les données, tapez le nom de la variable où sont les données (expt1).
expt1
## # A tibble: 957 x 15
## genotype background temperature fluctuation day.length vernalization
## <chr> <chr> <dbl> <chr> <dbl> <chr>
## 1 Col Ama Col 12 Con 16 NV
## 2 Col Ama Col 12 Con 16 NV
## 3 Col Ama Col 12 Con 16 NV
## 4 Col Ama Col 12 Con 16 NV
## 5 Col Ama Col 12 Con 16 NV
## 6 Col Ama Col 12 Con 16 NV
## 7 Col Ama Col 12 Con 16 NV
## 8 Col Ama Col 12 Con 16 NV
## 9 Col Ama Col 12 Con 8 NV
## 10 Col Ama Col 12 Con 8 NV
## # … with 947 more rows, and 9 more variables: survival.bolt <chr>, bolt <chr>,
## # days.to.bolt <dbl>, days.to.flower <dbl>, rosette.leaf.num <dbl>,
## # cauline.leaf.num <dbl>, blade.length.mm <dbl>, total.leaf.length.mm <dbl>,
## # blade.ratio <dbl>
Cela va nous montrer les 10 premières lignes du tableau ainsi que les colonnes qui rentrent dans l’écran.
Challenge: Combien y a t’il de lignes et colonnes dans les données?
Une autre option est d’utiliser la fonction View() pour accéder à une table intéractive où il est possible de trier et filtrer les données sans modifier la variable:
View(expt1)
glimpse() permet d’avoir une idée de la structure des données:glimpse(expt1)
## Observations: 957
## Variables: 15
## $ genotype <chr> "Col Ama", "Col Ama", "Col Ama", "Col Ama", "Col…
## $ background <chr> "Col", "Col", "Col", "Col", "Col", "Col", "Col",…
## $ temperature <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, …
## $ fluctuation <chr> "Con", "Con", "Con", "Con", "Con", "Con", "Con",…
## $ day.length <dbl> 16, 16, 16, 16, 16, 16, 16, 16, 8, 8, 8, 8, 8, 8…
## $ vernalization <chr> "NV", "NV", "NV", "NV", "NV", "NV", "NV", "NV", …
## $ survival.bolt <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"…
## $ bolt <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"…
## $ days.to.bolt <dbl> 28, 29, 31, 31, 32, 33, 34, 35, 69, 72, 76, 79, …
## $ days.to.flower <dbl> 43, 44, 43, 42, 44, 47, 47, 49, 90, 91, 97, 99, …
## $ rosette.leaf.num <dbl> 18, 15, 13, 17, 19, 14, 15, 18, 53, 49, 51, 55, …
## $ cauline.leaf.num <dbl> 6, 5, 4, 5, 4, 4, 3, 5, 6, 5, 6, 9, 6, 9, 8, 10,…
## $ blade.length.mm <dbl> 12.9, 10.5, 13.2, 14.6, 13.3, 14.7, 13.0, 17.8, …
## $ total.leaf.length.mm <dbl> 21.1, 19.1, 23.4, 27.2, 20.4, 25.3, 23.2, 31.3, …
## $ blade.ratio <dbl> 0.6113744, 0.5497382, 0.5641026, 0.5367647, 0.65…
Challenge Quel est le type des variables dans les données ?
dim() indique les dimensions du jeu de données (nombre de lignes et colonnes).dim(expt1)
## [1] 957 15
summary() permet d’obtenir des stats de base pour chaque colonne.summary(expt1)
## genotype background temperature fluctuation
## Length:957 Length:957 Min. :12.00 Length:957
## Class :character Class :character 1st Qu.:12.00 Class :character
## Mode :character Mode :character Median :12.00 Mode :character
## Mean :16.98
## 3rd Qu.:22.00
## Max. :22.00
##
## day.length vernalization survival.bolt bolt
## Min. : 8.00 Length:957 Length:957 Length:957
## 1st Qu.: 8.00 Class :character Class :character Class :character
## Median :16.00 Mode :character Mode :character Mode :character
## Mean :12.01
## 3rd Qu.:16.00
## Max. :16.00
##
## days.to.bolt days.to.flower rosette.leaf.num cauline.leaf.num
## Min. : 15.00 Min. : 21.00 Min. : 5.00 Min. : 1.000
## 1st Qu.: 38.00 1st Qu.: 46.00 1st Qu.: 24.00 1st Qu.: 5.000
## Median : 57.00 Median : 66.00 Median : 40.00 Median : 8.000
## Mean : 66.04 Mean : 71.59 Mean : 39.71 Mean : 7.208
## 3rd Qu.: 85.00 3rd Qu.: 92.00 3rd Qu.: 53.00 3rd Qu.: 9.000
## Max. :162.00 Max. :182.00 Max. :112.00 Max. :17.000
## NA's :83 NA's :95 NA's :96
## blade.length.mm total.leaf.length.mm blade.ratio
## Min. : 7.10 Min. : 9.00 Min. :0.0000
## 1st Qu.:18.00 1st Qu.:29.10 1st Qu.:0.5564
## Median :20.95 Median :34.60 Median :0.5948
## Mean :21.11 Mean :34.69 Mean :0.5874
## 3rd Qu.:24.30 3rd Qu.:40.27 3rd Qu.:0.6342
## Max. :59.00 Max. :66.30 Max. :6.5556
## NA's :327 NA's :303 NA's :304
Nous avons déjà utilisé de noubreuses fonctions:
install.packages()library()read_tsv()View()glimpse()summary()dim()Il est bien sûr difficile de ce souvenir du nom de toutes ces fonctions, ce quelles font et comment les utiliser. Heuresement, pour nous aider, une aide est disponible dans R en tapant le nom d’une fonction précédé de ?
?summary
Bien sur, une recherche sur internet est aussi une solution très efficace pour trouver de l’aide!
Challenge que fait la fonction
head()?
Challenge Comment regarder les dernières lignes de le notre jeu de donnée? (indice:
?tail)
Pour avoir une vue d’ensemble du jeu de données et détecter des problèmes, nous allons ustiliser la fonction vis_dat().
vis_dat(expt1)
Challenge Quel est le type de données le plus courrant dans le jeu de données ? Y a t’il des problème?
Le gris dans la figure générée par vis_dat() sont des données manquantes. Plusieurs stratégies peuvent être utilisée:
Pour la lesson, nous allons enlever les lignes contenant des données manquantes.
expt1 <- drop_na(expt1)
Challenge Combien de lignes nous reste-il?
Maintenant que nous avons vérifié la qualité de notre jeu de donnée, nous pouvons générer des graphiques afin d’en apprendre plus sur les données générées par l’expérience.
For this we will be using the ggplot2 package, which follows a general scheme termed “grammar or graphics”. “Grammar of graphics” might sound scary, but just think about them as simple building blocks of a plot. By combining and layering several blocks we can create our dream plot for a dream paper or for a lab meeting.
To build a graph we need several blocks:
Let’s focus on the first three: data, aesthetics and geometric object.
aes() function. Note, different geom_ objects can understand only a subset of aesthetics. For details, check their respective help (e.g. ?geom_line)geom_object. Examples include:
geom_point for scatter plots, dot plots)geom_line for trend lines, time series)You can find more imformation about how to build graphs with ggplot2 in this very useful cheatsheet.
Everyone (except Excel) likes boxplots, so we will start by plotting days.to.flower variable measured for different genotypes.
The ggplot() function initialises a plot. At the very minimum it needs a dataset to plot:
ggplot(expt1)
But this simply produces a blank (well, grey) canvas!
We haven’t told ggplot what aesthetics (this is ggplot2 terminology) we want it to map onto this blank canvas. For a boxplot we need to tell it what our x and y variables are.
ggplot(expt1, aes(x = genotype, y = days.to.flower))
As you can see, ggplot “mapped” the values in the genotype and days.to.flower variables of our table to the x and y aesthetics of the plot.
But this is still quite an empty plot, because we haven’t told ggplot what geometries we want it to draw in the canvas. In our case, we want a boxplot, which we can add on top of the created canvas by adding (literally +) a geom_boxplot():
ggplot(expt1, aes(genotype, days.to.flower)) +
geom_boxplot()
Exercice: can you make a violin plot instead? (hint:
?geom_violin)
Let’s now layer a couple of geom_objects on the same plot. Say, we want to have points for the individual values together with our boxplots:
ggplot(expt1, aes(genotype, rosette.leaf.num)) +
geom_jitter() +
geom_boxplot()
Exercice: can you modify this plot so that the points appear on top of the boxplots rather than behind them?
We can also modify the appearance of our geometry, for example it’s colour:
ggplot(expt1, aes(genotype, days.to.flower)) +
geom_boxplot(colour = "red")
Or perhaps the colour that fills the boxplots:
ggplot(expt1, aes(genotype, days.to.flower)) +
geom_boxplot(colour = "red", fill = "royalblue")
Or even its transparency:
ggplot(expt1, aes(genotype, days.to.flower)) +
geom_boxplot(colour = "red", fill = "royalblue", alpha = 0.5)
This is all very colourful, but rather gratuitous (what is this colour telling us about the data?!).
What if we wanted to colour our boxplots according to which fluctuation treatment the plants were exposed to? In ggplot2 language, we want to “map” the values of fluctuation onto the colour aesthetic of our plot. This should therefore go inside the aes() part of our graph:
ggplot(expt1, aes(genotype, days.to.flower, colour = fluctuation)) +
geom_boxplot()
Wow! Can you see what ggplot did for you!? It automatically split the data of each genotype into two groups and coloured them accordingly.
Now, let’s say we wanted to visualise the individual data points (not coloured) behind our boxplots (coloured by fluctuation):
ggplot(expt1, aes(genotype, days.to.flower, colour = fluctuation)) +
geom_jitter() +
geom_boxplot(alpha = 0.5)
As it is, the colour aesthetic is mapped to all geometries of the graph. This is because we defined it within the ggplot() function, which affects every geom_object that comes afterwards.
But we can also define aesthetics inside each geometry, for example:
ggplot(expt1, aes(genotype, days.to.flower)) +
geom_jitter() +
geom_boxplot(aes(fill = fluctuation), alpha = 0.5)
Exercice: say we are particularly interested in the relationship between number of rosette leafs and blade length in mm per genotype.
Visualize this relationship with a scatter plot (
geom_point()) betweenblade.length.mmandrosette.leaf.numand colour the points bygenotype.What happens if you colour the points by
days.to.bolt?
Often, our data has several grouping variables, and colours alone are not enough to fully represent the differences in the dataset.
For example, the scatterplot produced in the previous exercise is pretty, but very crowded. What if we wanted to isolate each genotype in individual plots?
This easy to accomplish with ggplot2 by adding a “facet” layer to our plot. There are two types of facets:
facet_grid() - arranges sub-plots in rows and/or columnsfacet_wrap() - arranges sub-plots in a ribbon that “wraps” around after a fixed number of plotsLet’s start with facet_grid() and see it in action:
ggplot(expt1, aes(blade.length.mm, rosette.leaf.num, colour = genotype)) +
geom_point() +
facet_grid(genotype ~ temperature)
In the code above, we use facet_grid() to define variables that partition our data by rows and columns, using the notation (rows ~ columns).
Exercice: In the previous graph, colouring the genotype is redundant with the facetting. Can you think of a more useful way to colour the points?
It is possible to use facet_grid() with a single variable:
# Facet by rows
ggplot(expt1, aes(blade.length.mm, rosette.leaf.num, colour = fluctuation)) +
geom_point() +
facet_grid(genotype ~ .)
# Facet by columns
ggplot(expt1, aes(blade.length.mm, rosette.leaf.num, colour = fluctuation)) +
geom_point() +
facet_grid(. ~ genotype)
When we are only partitioning by one variable, often facet_wrap() produces a better display. For example:
ggplot(expt1, aes(blade.length.mm, rosette.leaf.num, colour = fluctuation)) +
geom_point() +
facet_wrap( ~ genotype)
Exercice: Can you modify the previous graph to facet the data by the
fluctuationtreatment (as rows) andday.length(as columns) and colour the points by genotype.
In conclusion, by effectively combining facets, colours and other aesthetics you can represent many dimensions of your data in a single graph!
Exercice: Can you produce a graph similar to
.
Hint: facet the plot by
day.lengthandtemperatureand fill the boxplots byfluctuation.
But even this is not the limit. We can easily turn our plots into interactive ones using the plotly package.
First we store our plot in a variable and then pass it to the special ggplotly() function.
# Store plot in a variable called p1
p1 <- ggplot(expt1, aes(blade.length.mm, rosette.leaf.num, colour = fluctuation)) +
geom_point() +
facet_wrap(~genotype)
# Render an interactive plot using ggplotly function
ggplotly(p1)
Every element of a ggplot is modifiable. This is out of the scope for this module, but here’s a few examples and references.
Themes modify the overall appearance of the plot. Some come with ggplot2 and many others can be obtained from other packages such as ggthemes (which also has some additional geom objects).
# Example of built-in ggplot2 themes
ggplot(expt1, aes(genotype, days.to.flower)) +
geom_boxplot() +
theme_bw() +
labs(title = "Black and white theme")
ggplot(expt1, aes(genotype, days.to.flower)) +
geom_boxplot() +
theme_classic() +
labs(title = "Classic theme")
ggplot(expt1, aes(genotype, days.to.flower)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Minimal theme")
The theme() function is used to modify individual elements of the plot. The possibilities are so vast that the easiest way is to do a web-search for your intended purpose.
For example, a web-search for “vertical labels x axis ggplot2” returns as one of the first hits this solution:
ggplot(expt1, aes(genotype, days.to.flower)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Or searching for “altering plot colours ggplot2” returns this page, which somewhere gives an interesting solution:
ggplot(expt1, aes(genotype, days.to.flower, fill = fluctuation)) +
geom_boxplot() +
scale_fill_brewer(palette="Dark2")
Based on the principles outlined in this module, try and build a graph of your own dataset using ggplot2.
If you encounter any difficulties, we will discuss them in the next module!
Some other packages that add functionality to ggplot2: